Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec  5 09:34:07 2022"

The text continues here.

I hope everything is now installed correctly. Learning Git/Github seems useful and it is something that should prove to be useful in the future. I am a bit nervous about using R as I’m not very familar with it but we’ll see how this goes.

Thoughts about Exercise set 1

R really seems like a good tool for building graphs and plots and the logic seems pretty straightforward. However, it will take time to learn the syntax and all the commands. It is inevitable that learning new software feels at first like an avalanche of information, but this tutorial for R was nice in that it teaches one thing at a time.

my GitHub repository


Chapter 2) Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec  5 09:34:07 2022"

1) Reading data

lrn14 <- read.table("learning2014.csv", sep=",", header=TRUE)

# dimensions of the data
dim(lrn14)
## [1] 166   7
# structure of the data
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

This is survey data from 2014 of Approaches to learning.This is a subset of original learning2014 data, from which 7 variables were picked:gender, age, attitude, deep, stra, surf and points. ‘Deep’(deep learning), ‘stra’ (strategic learning) and ‘surf’ (superficial learning) have been combined from their related items. You can find more information about the dataset from learning2014

2) Overview of data

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

Looking at the data, there are more females than males. Age is the only variable that is clearly skewed (to the left), while other variables are more evenly distributed. Srongest correlations are found between deep learning and surface learning (-0.32), attitude and point (0.44). Additionally, surface leaning was correlated with strategic learning (-0.16) and attitude (-0.18). All these aforementioned correlations were in same direction in both sexes.

3) and 4) Linear regression model

model_1 <- lm(points ~ attitude + stra + surf, data = lrn14)

summary(model_1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
model_2 <- lm(points ~ attitude, data = lrn14)
summary(model_2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Initially, I chose attitude, strategic learning and surface learning as explanatory variables for points in linear regression model as they were the variables showing highest correlation with points. However, attitude was the only statistically significant predictor of points (Coeff 0.34, p-value <0.001), while the model explained 20% of variation in points (Adjusted R squared). After removing non significant variables from the model, attitude predicted points with Coeff 0.35 (p-value <0.001), while the model explained 19% of variation in points. Taken together, based on this model, attitude was the best predictor of points: better attitude predicted more points.

5) Diagnostic plots

Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

par(mfrow = c(2,2))
plot(model_2, which = c(1, 2, 5))

The main assumptions of Linear regression model are:
* linear relationship between response value and eplanatory value
* errors (residuals) have constant variance across all values of eplanatory variable
* errors are independent of each other
* errors have a normal distribution

Residuals vs fitted should not show a pattern where distribution of residuals varies along fitted values. In the figure above residuals are nicely evenly distributed.
QQ-plot should show a line if residuals are normally distributed. Figure above shows that the data is pretty much following the line excluding few outliers.
Residuals vs leverage should not show points outside Cook’s distance, which holds for the figure above.
Taken together, the assumptions of linear regression model hold true for the current model.


Chapter 3) Logistic regrsession

date()
## [1] "Mon Dec  5 09:34:14 2022"

Reading data

The data consists of student performance data in two Portuguese schools, collected by reports and questionnaires. It consists of two datasets from two different subjects: math and Portuguese. These datasets were combined and the variables of the combined dataset are printed below. In addition, two variables were calculated: alc_use is the average alcohol use from weekdays and weekends and high_use tells is true when alc_use is over 2

More information (including variable information) and the original dataset can be found from UCI_machine_learning_repository

alc <- read.csv("C:/Users/labpaavo/IODS-project/data/alc.csv")

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Variables of interest in relation to alcohol consumption and hypotheses

For the variables of interest in relation to high/low alcohol consumption I have chosen: sex, Pstatus, studytime and absences. The hypotheses are that:
- being a male predicts high alcohol consumption
- parents living apart predicts high alcohol consumption
- low studytime predicts high alcohol consumption
- high absences predicts high alcohol consumption

Exploring the chosen variables

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)

library(descr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

my_variables = c("high_use", "sex", "Pstatus", "studytime", "absences")
my_data <- alc[my_variables]

summary(my_data)
##   high_use           sex              Pstatus            studytime    
##  Mode :logical   Length:370         Length:370         Min.   :1.000  
##  FALSE:259       Class :character   Class :character   1st Qu.:1.000  
##  TRUE :111       Mode  :character   Mode  :character   Median :2.000  
##                                                        Mean   :2.043  
##                                                        3rd Qu.:2.000  
##                                                        Max.   :4.000  
##     absences     
##  Min.   : 0.000  
##  1st Qu.: 1.000  
##  Median : 3.000  
##  Mean   : 4.511  
##  3rd Qu.: 6.000  
##  Max.   :45.000
CrossTable(my_data$high_use, my_data$sex)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =========================================
##                     my_data$sex
## my_data$high_use        F       M   Total
## -----------------------------------------
## FALSE                 154     105     259
##                     2.244   2.500        
##                     0.595   0.405   0.700
##                      0.79    0.60        
##                     0.416   0.284        
## -----------------------------------------
## TRUE                   41      70     111
##                     5.235   5.833        
##                     0.369   0.631   0.300
##                      0.21    0.40        
##                     0.111   0.189        
## -----------------------------------------
## Total                 195     175     370
##                     0.527   0.473        
## =========================================
CrossTable(my_data$high_use, my_data$Pstatus)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =========================================
##                     my_data$Pstatus
## my_data$high_use        A       T   Total
## -----------------------------------------
## FALSE                  26     233     259
##                     0.014   0.002        
##                     0.100   0.900   0.700
##                     0.684   0.702        
##                     0.070   0.630        
## -----------------------------------------
## TRUE                   12      99     111
##                     0.032   0.004        
##                     0.108   0.892   0.300
##                     0.316   0.298        
##                     0.032   0.268        
## -----------------------------------------
## Total                  38     332     370
##                     0.103   0.897        
## =========================================
CrossTable(my_data$high_use, my_data$studytime)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## =========================================================
##                     my_data$studytime
## my_data$high_use        1       2       3       4   Total
## ---------------------------------------------------------
## FALSE                  56     128      52      23     259
##                     2.314   0.017   2.381   0.889        
##                     0.216   0.494   0.201   0.089   0.700
##                     0.571   0.692   0.867   0.852        
##                     0.151   0.346   0.141   0.062        
## ---------------------------------------------------------
## TRUE                   42      57       8       4     111
##                     5.400   0.041   5.556   2.075        
##                     0.378   0.514   0.072   0.036   0.300
##                     0.429   0.308   0.133   0.148        
##                     0.114   0.154   0.022   0.011        
## ---------------------------------------------------------
## Total                  98     185      60      27     370
##                     0.265   0.500   0.162   0.073        
## =========================================================
my_data %>% group_by(high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 2 × 3
##   high_use count mean_absences
##   <lgl>    <int>         <dbl>
## 1 FALSE      259          3.71
## 2 TRUE       111          6.38
g1 <-  ggplot(data = my_data, aes(x = sex))
g1 + geom_bar() + facet_wrap("high_use") 

g2 <-  ggplot(data = my_data, aes(x = Pstatus))
g2 + geom_bar() + facet_wrap("high_use") 

g3 <-  ggplot(data = my_data, aes(x = studytime))
g3 + geom_bar() + facet_wrap("high_use") 

g4 <-  ggplot(data = my_data, aes(x = absences))
g4 + geom_bar() + facet_wrap("high_use") 

Looking at data, it seems that first of all there are more students with low alcohol consumption (259) compared to high- (111). Absences is clearly skewed to left (less absences). It seems that high consumption group has more males (63%) compared to low consumption (41%). Parental status seems to be similarly distributed between the grpoups. In regards to studytime, high consumption group has notably higher frequency of students who study very little (37.8% vs 21.6%). Finally, high consumption group has higher mean absences (6.4), compared to low consumption group (3.7). Taken together, it seems that sex, studytime and absences could maybe predict alcohol consumptions, whereas parental status seems to be pretty evenly distributed between groups and thus will unlikely predict alcohol consumption.

Logistic regression

m <- glm(high_use ~ sex + Pstatus + studytime + absences, data = my_data, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + Pstatus + studytime + absences, 
##     family = "binomial", data = my_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1339  -0.8389  -0.5966   1.0728   2.1221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.01835    0.55916  -1.821 0.068574 .  
## sexM         0.84663    0.25509   3.319 0.000903 ***
## PstatusT     0.12379    0.40420   0.306 0.759415    
## studytime   -0.41526    0.16023  -2.592 0.009550 ** 
## absences     0.08928    0.02345   3.807 0.000141 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 408.40  on 365  degrees of freedom
## AIC: 418.4
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3611909 0.1172496 1.0606103
## sexM        2.3317700 1.4202152 3.8687722
## PstatusT    1.1317735 0.5251790 2.5945605
## studytime   0.6601657 0.4776116 0.8967295
## absences    1.0933826 1.0465506 1.1475274

Logistic regression models showed that sex, studytime and absences were significant predictors high/low alcohol consumptions:
sexM log odds .85 (p-value <0.001) studytime log odds -.42 (p-value < 0.01), meaning that for unit increase in studytime the log odds for being n high alcohol consumption decrease by 0.42.
absences log odds .09 (p-value <0.001), meaning that for each absence the log odds for high alcohol consumption changes by 0.09.

The odds for a male being in high alcohol consumption groups over a female is 2.3 (CI 1.4; 3.9). For unit increase in studytime the odds of being in high alcohol consumption group decrese by 34% (CI 0.48; 0.90), note that studytime had four possible values corresponging to <2 hours, 2 to 5 hours, 5 to 10 hours and >10 hours. For each absence the odds of being in high alcohol consumption group increase by 9.3% (CI 4.7%; 14.8%).

Against the initial hypothesis parental status did not predict alcohol consumption, however this was could be predicted already by exploring the data. The effect of sex, studytime and absences were as hypothesized as being a male, lower studytime and higher absences predicted being in high alcohol consumption group.

Predictive power of the model

# drop parental status from the model as it wasn't statistically significant predictor
new_model <- glm(high_use ~ sex + studytime + absences, data = my_data, family = "binomial")

probabilities <- predict(new_model, type = "response")
my_data <- mutate(my_data, probability = probabilities)
my_data <- mutate(my_data, prediction = (probability > 0.5))

table(high_use = my_data$high_use, prediction = my_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   250    9
##    TRUE     86   25
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# training error 
loss_func(class = my_data$high_use, prob = my_data$probability)
## [1] 0.2567568
g <- ggplot(my_data, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

The model resulted in 250 true negatives, 25 true positives, 86 false negatives and 9 false positives.
The training error of the model is 25.7%.

10-fold cross-validation

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(class = my_data$high_use, prob = my_data$probability)
## [1] 0.2567568
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2567568

The test set error of my model with 10-fold cross-validation is ~0.27, which is slightly worse than the model introduced in Exercise set.

Chapter 4) Clustering

date()
## [1] "Mon Dec  5 09:34:17 2022"

Reading data and exploring it

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.92 loaded
library(tidyr)
library(ggplot2)
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Boston data consists of 14 variables (and 506 observations) and it is about housing values in suburbs of Boston. Details and full descriptions of the variables can be found from Boston.

Graphical overview of the data

#summaries of the variable data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

#library(Hmisc)
boston_df <- as.data.frame(Boston)
#hist.data.frame(boston_df) this plot gives an error while knitting the index file so unfortunately can't display it


cor_matrix <- cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The figure of pairs function was so difficult to read on my laptop, so I ended up drawing also histograms to look at the distributions of the variables (unfortunately it can’t be displayed while knitting index because of the plot size, code commented out). Many variables are heavily skewed. The only variables close to normal distribution seem to be ‘average number of rooms per dwelling’(rm) and ‘median value of owner-occupied homes in $1000s’ (medv).

The correlation plot shows the relationships between the variables. Strongest negative correlations can be found between:
* weighted mean of distances to five Boston employment centres (dis) and proportion of owner-occupied units built prior to 1940 (age)
* dis and nitrogen oxides concentration (nox)
* dis and proportion of non-retail business acres per town (indus)
* lower status of the population percent (lastat) and median value of owner-occupied homes in $1000s (medv)

Strongest positive correlation is between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax)

Standardize the dataset

boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)


ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))

Linear discriminant analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "black", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes)
lda.arrows(lda.fit, myscale = 1)

Predict classes with LDA model

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class) 
##           predicted
## correct    low med_low med_high high
##   low       19       8        0    0
##   med_low    5      18        2    0
##   med_high   2      10       15    0
##   high       0       0        0   23

LDA model seems to predict classes quite well with accuracy of ~75% (varying between seeds). The model performed best in classifying high crime rates.

k-means clustering

# reloading the dataset
library(MASS)
data("Boston")

# standardizing the dataset
boston_scaled2 <- as.data.frame(scale(Boston))

# distances between observations
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Looking for optimal number of clusters

library(ggplot2)
set.seed(13)

k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The optimal number of clusters is when the value of total WCSS (y-axis) changes radically, here it seems to be around two clusters.

# k-means clustering
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2[1:6], col = km$cluster) 

pairs(boston_scaled2[7:14], col = km$cluster) 

I plotted the clusters in two separate plots, since the figures are otherwise two small for me to see in laptop, but this way I don’t see all the pairs. Overall it looks like two clusters works nicely in most of the pairs.

Super-Bonus:

matrix product

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# color is the crime classes of exercise set: classes is determined earlier as classes <- as.numeric(train$crime) 
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)

Chapter 5) Dimensionality reduction techniques

date()
## [1] "Mon Dec  5 09:34:23 2022"

Overview of the data

The variables of this dataset are:
edu2_fm - Proportion of females with at least secondary education divided by the proportion of males with at least secondary education
labofm - Proportion of females in the labour force divided by the proportion of males in labour force
edu_exp - Expected years of schooling
life_exp - life expectancy
gni - Gross National Income per capita
mat_mor - Maternal mortality ratio
ado_birth - Adolescent birth rate
parli_perc - Percetange of female representatives in parliament

library(tidyr)
library(dplyr)
library(GGally)
library(ggplot2)
library(corrplot)
library(FactoMineR)

human <- read.csv("C:/Users/labpaavo/IODS-project/data/human.csv")

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ edu2_fm   : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labofm    : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ edu_exp   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ life_exp  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni       : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mat_mor   : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ ado_birth : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parli_perc: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8
summary(human)
##     edu2_fm           labofm          edu_exp         life_exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni            mat_mor         ado_birth        parli_perc   
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
p <- ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))
p

cor(human) %>% corrplot

From the data we can see that life expectancy is skewed towards right, whereas GNI, maternal mortality ratio and adolescent birth rate are heavily skewed to left. Many variables are strongly correlated with each other, except for gender ratio in labor force (labofm) and gender ratio in parliment.

PCA on raw data

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

We can see that with raw data the 1st principal component accounts for basically all the variation (99,999%). The figure shows that the variable responsible for this is GNI.

PCA on standardized data

human_std <- scale(human)
pca_human_std <- prcomp(human_std)

biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))

summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

Now PC1 accounts for ~54% of variation compared to 99.99% previously. This is because the raw values of GNI vary from ~20000 to ~500, whereas i.e., labofm and edu2_fm can only have values between 0 and 1. Thus, if we only use absolute values, GNI will account for practically all the variation in the data. This can be resolved by standardizing the data, which makes the changes in variables more comparable.

Interpretation of the first two PC dimensions

It seems that the in the first two principal component dimensions the variables form three ‘groups’. Maternal mortality ratio and adolescent birth rate to seem to be close to each other, meaning that the two are closely related to each other in this data. Another pair can be seen between gender ratio in labor force and the percentage of females in parliment. The rest of the variables are grouped together.

Tea dataset

tea <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", 
                  sep = ",", header = T)
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : chr  "breakfast" "breakfast" "Not.breakfast" "Not.breakfast" ...
##  $ tea.time        : chr  "Not.tea time" "Not.tea time" "tea time" "Not.tea time" ...
##  $ evening         : chr  "Not.evening" "Not.evening" "evening" "Not.evening" ...
##  $ lunch           : chr  "Not.lunch" "Not.lunch" "Not.lunch" "Not.lunch" ...
##  $ dinner          : chr  "Not.dinner" "Not.dinner" "dinner" "dinner" ...
##  $ always          : chr  "Not.always" "Not.always" "Not.always" "Not.always" ...
##  $ home            : chr  "home" "home" "home" "home" ...
##  $ work            : chr  "Not.work" "Not.work" "work" "Not.work" ...
##  $ tearoom         : chr  "Not.tearoom" "Not.tearoom" "Not.tearoom" "Not.tearoom" ...
##  $ friends         : chr  "Not.friends" "Not.friends" "friends" "Not.friends" ...
##  $ resto           : chr  "Not.resto" "Not.resto" "resto" "Not.resto" ...
##  $ pub             : chr  "Not.pub" "Not.pub" "Not.pub" "Not.pub" ...
##  $ Tea             : chr  "black" "black" "Earl Grey" "Earl Grey" ...
##  $ How             : chr  "alone" "milk" "alone" "alone" ...
##  $ sugar           : chr  "sugar" "No.sugar" "No.sugar" "sugar" ...
##  $ how             : chr  "tea bag" "tea bag" "tea bag" "tea bag" ...
##  $ where           : chr  "chain store" "chain store" "chain store" "chain store" ...
##  $ price           : chr  "p_unknown" "p_variable" "p_variable" "p_variable" ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : chr  "M" "F" "F" "M" ...
##  $ SPC             : chr  "middle" "middle" "other worker" "student" ...
##  $ Sport           : chr  "sportsman" "sportsman" "sportsman" "Not.sportsman" ...
##  $ age_Q           : chr  "35-44" "45-59" "45-59" "15-24" ...
##  $ frequency       : chr  "1/day" "1/day" "+2/day" "1/day" ...
##  $ escape.exoticism: chr  "Not.escape-exoticism" "escape-exoticism" "Not.escape-exoticism" "escape-exoticism" ...
##  $ spirituality    : chr  "Not.spirituality" "Not.spirituality" "Not.spirituality" "spirituality" ...
##  $ healthy         : chr  "healthy" "healthy" "healthy" "healthy" ...
##  $ diuretic        : chr  "Not.diuretic" "diuretic" "diuretic" "Not.diuretic" ...
##  $ friendliness    : chr  "Not.friendliness" "Not.friendliness" "friendliness" "Not.friendliness" ...
##  $ iron.absorption : chr  "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" ...
##  $ feminine        : chr  "Not.feminine" "Not.feminine" "Not.feminine" "Not.feminine" ...
##  $ sophisticated   : chr  "Not.sophisticated" "Not.sophisticated" "Not.sophisticated" "sophisticated" ...
##  $ slimming        : chr  "No.slimming" "No.slimming" "No.slimming" "No.slimming" ...
##  $ exciting        : chr  "No.exciting" "exciting" "No.exciting" "No.exciting" ...
##  $ relaxing        : chr  "No.relaxing" "No.relaxing" "relaxing" "relaxing" ...
##  $ effect.on.health: chr  "No.effect on health" "No.effect on health" "No.effect on health" "No.effect on health" ...
#View(tea)
ind <- 1:ncol(tea)
tea[, ind] <- lapply(tea[, ind], as.factor)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : Factor w/ 61 levels "15","17","18",..: 24 30 32 8 33 6 22 21 25 22 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
library(Hmisc)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
## 
##     subplot
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
hist.data.frame(tea[1:9])

hist.data.frame(tea[10:18])

hist.data.frame(tea[19:28])

hist.data.frame(tea[29:36])

MCA on tea data

The categorical variables chosen for the MCA were: Tea (black, earl grey, green), How (alone, lemon, milk, other), how (tea bag, tea bag + unpackagerd, unpackaged), sugar (no sugar, sugar), where (chain store, tea shop, chain store + tea shop) and lunch (lunch, not lunch).

# select some columns 
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

plot(mca, invisible=c("var"), graph.type = "classic")

The MCA factor map shows gactors that are close to each other in the data. We can see that tea bought at tea shop is usually unpackaged. Another group can be seen with variables that describe incosistent behavior: buying tea from both chain stores and tea shops is related to having both tea bags and unpackaged teas as well as non-consistent way of drinking tea (other). The other figure shows the individuals in the MCA factor map.